In this study, we will explore different ways that social relationships can be clustered in order to find categories of relationships. To find relationship categories we will use a two-step process 1. Use a dimension reduction technique to organize relationships into a two-dimensional space that will reveal clusters of relationships 2. Use a clustering algorithm on the output of step 1 to assign relationships to clusters
There are many different dimension reduction techniques and clustering algorithms. Each of these steps include different parameters and choices that must be made that can effect the final output. We will explore as much of this as possible.
In Study 3, we used PCA as a dimensionality reduction technique to find the over-arching dimensions the can represent social relationships knowledge. In the following analyses, we will try to find categories of social relationships, using data-driven methods.
By visualizing the PCA relationship plots, we may be able to see some clusters that are present. Here, we will cluster the relationships by using the scores for each relationship for the first four components.
We calculated the optimal number of clusters using silhouette scores for each cluster solution.
The recommended number of clusters for k-means is 5 and for hierarchical clustering is 5 clusters.
Five cluster solution
Five cluster solution
Uniform manifold approximation and projection (UMAP) is another dimension reduction technique. Whereas t-SNE (discussed in supplementary) conducts dimension reduction and aims to retain the local structure information of the data, UMAP aims to retain some of the global structure as well.
There are three parameters to consider when using UMAP:
For the distance function, we will use the most popular in the field, Euclidean distance. For the other two parameters, we will explore the results when tweaking these parameters.
More information on UMAP can be found here:
The nearest neighbor is the most important parameter as it determines amount of local vs global information to retain. A lower value will capture more local structure, and therefore, provide results more similar to t-SNE. A higher value will capture more global structure.
We will explore UMAP results with a range of parameters. It is important to note that the nearest neighbor parameter is dependent on the size of your data. So we will use a small value (2) up to a large value of about a quater of your data (40)
The columns and rows of the above figure indicate the value of the nearest neighbor and the minimum distance parameters, respectively. The advantage of UMAP over t-SNE is that it can retain more of the global structure from the original data. Therefore, we will use a moderate nearest neighbor. We also want to be able to cluster relationships, and so we should use a smaller minimum distance value. Therefore, we will use a nearest neighbor of 5 and a minimum distance of 0.1 for the clustering analysis.
Both kmeans clustering and hierarchical clustering indicate the two clusters would be optimal. However, this seems unintuitive and so we will go with the second-best data-driven solution, six clusters.
t-SNE is a dimensionality reduction technique. Whereas PCA attempts to organize data to explain the most variance and capture the most global structure, t-SNE focuses on capturing the local structure of the data. In this analysis, we will run t-SNE on the original data of 159 relationships rated on 30 dimensions. We will then use clustering algorithms to label the results and provide us with a categorization of the social relationships.
Perplexity is a tuneable parameter of t-SNE. It gives a method of balancing how t-SNE will try to balance local vs global structure of data. Perplexity values are usually between 5-50, where a low perplexity focuses on more local structure and a higher perplexity focuses on more global structure. More information on perplexity and t-SNE can be found on the creator’s official webpage.
It seems that a perplexity of 5 gives us specific clusters. Since we are interested in creating relationship categories, we can ignore the more global structure of the data as PCA does a better job of this, and focus on creating distinct clusters of relationships
The recommended number of clusters for k-means is 6 and for hierarchical clustering is 6 clusters.
Social relationship ratings ratings
This is a matrix of the average ratings of 159 social relationships on 30 dimensions from the literature.This data was collected as part of Study 3.